WARCProcessor: An Integrative Tool for Building and Management of Web Spam Corpora

نویسندگان

  • Miguel Callón
  • Jorge Fdez-Glez
  • David Ruano-Ordás
  • Rosalía Laza
  • Reyes Pavón
  • Florentino Fernández Riverola
  • José Ramon Méndez
چکیده

In this work we present the design and implementation of WARCProcessor, a novel multiplatform integrative tool aimed to build scientific datasets to facilitate experimentation in web spam research. The developed application allows the user to specify multiple criteria that change the way in which new corpora are generated whilst reducing the number of repetitive and error prone tasks related with existing corpus maintenance. For this goal, WARCProcessor supports up to six commonly used data sources for web spam research, being able to store output corpus in standard WARC format together with complementary metadata files. Additionally, the application facilitates the automatic and concurrent download of web sites from Internet, giving the possibility of configuring the deep of the links to be followed as well as the behaviour when redirected URLs appear. WARCProcessor supports both an interactive GUI interface and a command line utility for being executed in background.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application of international energy efficiency standards for energy auditing in a University buildings

This study seeks to provide insights on understanding the contemporary problems of energy efficiency in Ukrainian universities by developing a comprehensive energy efficiency management framework that encompasses its participating subjects, objects and key drivers along with suggesting its implementation mechanism and tools. Emphasis should be given that the current situation of inefficient and...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Unmet Care Needs in Breast Cancer Survivors: An Integrative review

Abstract Introduction: Understanding the unmet care needs of breast cancer survivors is one of the important aspects in healthcare delivery. Objective: This study aimed to identify the unmet needs of breast cancer survivors. Materials & Methods: This Integrative review search of evidence‐based research from five electronic databases (Web of Science, PubMed, Science Direct, Scopus and Google ...

متن کامل

A Perspective of Evolution After Five Years: A Large-Scale Study of Web Spam Evolution

Identifying and detecting web spam is an ongoing battle between spam-researchers and spammers which has been going on since search engines allowed searching of web pages to the modern sharing of web links via social networks. A common challenge faced by spam-researchers is the fact that new techniques depend on requiring a corpus of legitimate and spam web pages. Although large corpora of legit...

متن کامل

Detecting Spam Content in Web Corpora

To increase the search result rank of a website, many fake websites full of generated or semigenerated texts have been made in last years. Since we do not want this garbage in our text corpora, this is a becoming problem. This paper describes generated texts observed in the recently crawled web corpora and proposes a new way to detect such unwanted contents. The main idea of the presented appro...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 18  شماره 

صفحات  -

تاریخ انتشار 2017